Clustered data

Below is a summary of the clustering results. It shows the top few rows of data divided into either cluster 1 or 2. The results also include different sum of squares measurements which can be used to determine the quality of clustering.

## $cluster
##   [1] 2 2 2 2 1 1 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 1
##  [38] 1 2 1 2 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 1 1 1 2 2 1 2 2 2 2
##  [75] 2 1 1 2 1 1 1 2 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 2 1 2 1 1 1 1 1
## [112] 1 2 2 2 1 1 2 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1
## [149] 2 2 1 1 1 2 1 2 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 2 1 2 2
## [186] 1 1 1 1 2 1 2 1 2 2 2 1 1 2 2 1 2 2 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1
## [223] 2 2 2 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 2 2 2 1 1 2 2 2 1 1
## [260] 1 2 1 2 1 2 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 2 2 2
## [297] 2 1 2 1 1 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 2 1 1 2 1 1 2 2 1 1 2 2 1 1 1 2 1
## [334] 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 2 2 2 2
## [371] 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 2 1 1 1 1 1 2 1
## [408] 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 1 2
## 
## $centers
##         aye      nay     other
## 1 122.56889 106.9956  90.43556
## 2  70.32673 145.6337 104.03960
## 
## $totss
## [1] 589869.9
## 
## $withinss
## [1] 43093.49 77671.01
## 
## $tot.withinss
## [1] 120764.5
## 
## $betweenss
## [1] 469105.4

Visualization of Clustering Results

Explained Variance: Quality of Clustering

The variance explained by clustering can be used as a metric to determine the quality of clustering. It is determined using the following equation:

## [1] "explained variance = between sum of squares/total sum of squares"
## [1] 79.52692

Clustering the house voting data by the three categories “aye”, “nay”, and “other” resulted in an explained variance of 79.5%. This is quite high, but 20% of the total variance of the data is still unexplained by clustering.

Visualizing impact of clusters on explained variance

While increasing the number of clusters increases the explained variance, past the 3rd cluster we experience diminishing returns. This means that we do not improve the explained variance significantly despite increasing our model’s complexity substantially.

Another way to determine the ideal number of clusters is using the NbClust() function which generates multiple cluster models with the ideal number being the one that appears with the highest frequency.

Elbow vs. Nbclust()

Both the Elbow and Nbclust() methods recommend using 2 clusters. The elbow method shows going beyond two clusters does not dramatically improve the explained variance despite the increasing complexity. While the Nbclust() model does not show the associated diminishing returns it does return a “majority rule perspective”. From the histogram we can see that a majority of the clustering models (12) chose 2 clusters to maximize the explained variance.

3D visualization of data

## Joining, by = "party.labels"